Urban Biodiversity Atlas & Habitat Connectivity

Authored by: Vamshi Krishna Y R
Duration: 90 mins
Level: Intermediate
Pre-requisite Skills: Python, Pandas, Matplotlib, Folium, Geopandas, Plotly, Scikit-learn

Use case scenario

As a: City of Melbourne urban planner, environmental advocate, or community member interested in enhancing urban biodiversity and ecological connectivity.

I want to: Identify biodiversity cold-spots, map habitat gaps, and recommend targeted planting or habitat interventions to reconnect fragmented green spaces.

So that I can: Support evidence-based greening initiatives, foster community stewardship, and improve the ecological health and liveability of Melbourne’s urban environment.

By:

  • Integrating open datasets on trees, canopy cover, invertebrates, open spaces, and barriers.
  • Computing insect biodiversity metrics and identifying ecological hot-spots and cold-spots.
  • Modelling habitat connectivity and recommending priority actions for planting or habitat structures.
  • Publishing actionable insights through interactive dashboards and open APIs.

What this use case will teach you

  • How to ingest, clean, and spatially analyse diverse urban ecological datasets using Python and geospatial libraries.
  • Techniques for quantifying species richness, diversity, and identifying ecological clusters in an urban context.
  • Approaches to model habitat connectivity and prioritise interventions using spatial analysis and predictive modelling.
  • Best practices for communicating actionable insights through interactive dashboards and open data APIs.
  • The societal and environmental impact of data-driven urban greening strategies.

Objectives

  • Map current biodiversity and diagnose habitat gaps across the City of Melbourne.
  • Prioritise micro-locations for targeted planting or habitat structures to enhance species connectivity.
  • Develop and publish an interactive Urban Biodiversity Atlas dashboard and open API.
  • Empower council and community stakeholders with actionable, evidence-based recommendations for urban greening.

Initialisation¶

Importing necessary libraries

In [32]:
import warnings
warnings.filterwarnings("ignore")
# Enable inline plotting in Jupyter notebooks
%matplotlib inline

# Data handling
import pandas as pd                   # Data manipulation (e.g. reading CSVs, merging sensor & API data)
import numpy as np                    # Numerical operations (e.g. statistics, array maths)
import math                           # Maths functions (e.g. sqrt for computing buffer radii)
import json                           # Parse JSON responses from Open Meteo or other APIs
from io import StringIO               # In-memory text I/O (e.g. loading CSV data from a string)
import re                            # Regular expressions for text processing (e.g. extracting genus from notes)

# Geospatial processing
import geopandas as gpd               # GeoDataFrames for shapefiles & GeoJSON
from shapely.geometry import Point, shape  # Create/manipulate geometric objects (sensor points, canopy polygons)
from geopy.distance import geodesic   # Calculate great-circle distances (e.g. Haversine formula for 50 m radius)

# Static & interactive mapping
import contextily as ctx              # Basemap tiles for GeoPandas plots (e.g. OSM background)
import folium                         # Interactive leaflet maps in Jupyter (e.g. pan/zoom sensor coverage)
from folium.plugins import MarkerCluster  # Interactive map with Marker
from branca.element import Template, MacroElement # Overlay Legend on Folium Maps

# Visualisation
import matplotlib.pyplot as plt       # Static charts (e.g. bar plots, heatmaps)
import seaborn as sns                 # Statistical viz (e.g. correlation matrix heatmap)
import plotly.express as px           # Interactive plots (e.g. time-series of PM₂.₅)

# HTTP requests with caching & retries
import requests                       # API calls (e.g. fetch tree-canopy GeoJSON)
import requests_cache                 # Cache API responses (avoid repeated rate limits)
from retry_requests import retry      # Retry logic (e.g. for transient network errors)
import openmeteo_requests             # Client for Open Meteo weather & air-pollution API

# Notebook display helpers
from IPython.display import IFrame    # Embed HTML (e.g. folium maps) directly in cells

# Utility data structures
from collections import defaultdict   # Default dictionaries (e.g. grouping counts by schedule)

# Machine learning pipeline
from sklearn.pipeline import Pipeline           # Chain preprocessing & model steps
from sklearn.impute import SimpleImputer       # Handle missing values (e.g. fill NaNs in PM₂.₅)
from sklearn.preprocessing import OneHotEncoder# Encode categorical features (e.g. month → dummy vars)
from sklearn.ensemble import RandomForestRegressor  # Bagging-based ensemble regressor
from xgboost import XGBRegressor               # Gradient-boosting regressor
from sklearn.model_selection import RandomizedSearchCV, GroupKFold  # Hyperparameter search & grouped CV
from sklearn.metrics import mean_squared_error, r2_score  # Evaluation metrics (e.g. RMSE, R²)
import joblib                                  # Save/load trained models (e.g. persist best model)

from geopy.geocoders import Nominatim           # Geocoding (e.g. convert addresses to coordinates)
from geopy.extra.rate_limiter import RateLimiter # Geocoding rate limiter (e.g. avoid exceeding API limits)

Importing the data through API from open data portal of Melbourne¶

The below function accesses open datasets via API endpoints, enabling users to obtain information in CSV format suitable for in-depth analysis. By providing the dataset identifier and a valid API key, it issues a request to the Melbourne data portal and interprets the response to retrieve pertinent data. This method streamlines the incorporation of diverse datasets—such as microclimate sensor, urban tree canopies, and tree planting zone, facilitating straightforward access and efficient data integration for applications in urban planning research.

In [2]:
def import_data(datasetname): # pass in dataset name and api key

    """
    Imports a dataset from the City of Melbourne Open Data API.

    Parameters:
    - dataset_id (str): The unique dataset identifier.
    Returns:
    - pd.DataFrame: The imported dataset as a pandas DataFrame.
    """

    dataset_id = datasetname

    base_url = 'https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
    apikey = <your API key>  #Insert your API key
    dataset_id = dataset_id
    format = 'csv'

    url = f'{base_url}{dataset_id}/exports/{format}'
    params = {
        'select': '*',
        'limit': -1,  # all records
        'lang': 'en',
        'timezone': 'UTC'
    }

    # GET request
    response = requests.get(url, params=params)

    if response.status_code == 200:
        # StringIO to read the CSV data
        url_content = response.content.decode('utf-8')
        datasetname = pd.read_csv(StringIO(url_content), delimiter=';')
        print(f' Imported the {dataset_id} dataset with {len(datasetname)} records succesfully \n')
        return datasetname 
    else:
        return (print(f'Request failed with status code {response.status_code}'))

Datasets¶

Melbourne’s liveability depends not only on its built form but also on the ecological health of its parks, streets and private gardens. Insects, birds and bats pollinate plants, recycle nutrients and form the base of urban food-webs, yet their habitat is fragmented by roads and dense development. In line with Chameleon’s mission to “enhance life through the application of smart-city technologies” and the goal of showcasing practical applications of City-of-Melbourne (CoM) open data, this use case will build an Urban Biodiversity Atlas that quantifies species richness, pin-points ecological “cold-spots” and recommends green corridors that reconnect them. Insights will guide both council planting programmes and community-led actions such as pollinator gardens and nesting-box installations.

Below are the primary datasets used

Theme Dataset Source Key fields / notes
Flora structure Trees with Species & Dimensions (Urban Forest) – ~70 000 street & park trees Melbourne open data (link) Species, DBH, life-stage, health, location
Canopy extent Tree Canopies 2021 (Urban Forest) Melbourne open data (link) High-resolution canopy polygons
Invertebrates Insect Records – “Little Things that Run the City” Melbourne open data (link) Insect species, abundance, sampling site
Barriers 2020 Building Footprints Melbourne open data (link) Footprint polygons for least-cost analysis

Importing dataset - insect-records-in-the-city-of-melbourne-from-little-things-that-run-the-city¶

About the dataset: This dataset contains detailed insect records from "The Little Things that Run the City" project - a critical resource for understanding urban biodiversity patterns across Melbourne. The collection includes identified insect species found across various parks and gardens, providing baseline data for mapping biodiversity hotspots and analysing habitat connectivity. Field surveys were conducted between October 2014 and March 2015, with species identification completed between April and September 2015. Understanding insect diversity is fundamental to developing targeted habitat interventions and measuring the ecological health of Melbourne's urban environment.

In [3]:
# Importing insect records dataset
insect_records = 'insect-records-in-the-city-of-melbourne-from-little-things-that-run-the-city' 
df_insect_records = import_data(insect_records)
df_insect_records.to_csv('df_insect_records.csv', index=False) # saving into a local file
df_insect_records_orig = df_insect_records #saving the original dataset
print('First few rows of the dataset:\n')
df_insect_records.head()
 Imported the insect-records-in-the-city-of-melbourne-from-little-things-that-run-the-city dataset with 1295 records succesfully 

First few rows of the dataset:

Out[3]:
taxa kingdom phylum class order family genus species identification_notes location sighting_date
0 Insect ANIMALIA ARTHROPODA INSECTA HYMENOPTERA PTEROMALIDAE NaN NaN Pteromalidae 4 Fitzroy-Treasury Gardens NaN
1 Insect ANIMALIA ARTHROPODA INSECTA DIPTERA PYRGOTIDAE NaN NaN Pyrgotidae 1 Royal Park NaN
2 Insect ANIMALIA ARTHROPODA INSECTA DIPTERA SCENOPINIDAE NaN NaN Scenopinidae 2 Royal Park NaN
3 Insect ANIMALIA ARTHROPODA INSECTA DIPTERA SEPSIDAE NaN NaN Sepsidae 1 Princes Park NaN
4 Insect ANIMALIA ARTHROPODA INSECTA DIPTERA STRATIOMYIDAE NaN NaN Stratiomyidae 2 Lincoln Square NaN

Importing dataset - tree-canopies-2021-urban-forest¶

About the dataset: The Tree Canopies 2021 - Urban Forest dataset maps the extent of tree canopy cover across the City of Melbourne using aerial imagery and LiDAR data. It provides detailed spatial insights into urban forest coverage, supporting initiatives in climate resilience, biodiversity, and urban planning.

In [4]:
# Importing tree canopy dataset
tree_canopy_2021 = 'tree-canopies-2021-urban-forest' 
df_tree_canopy_2021 = import_data(tree_canopy_2021)
df_tree_canopy_2021.to_csv('df_tree_canopy_2021.csv', index=False) # saving into a local file
df_tree_canopy_2021_orig = df_tree_canopy_2021 #saving the original dataset
print('First few rows of the dataset:\n')
df_tree_canopy_2021.head(5)
 Imported the tree-canopies-2021-urban-forest dataset with 57980 records succesfully 

First few rows of the dataset:

Out[4]:
geo_point_2d geo_shape
0 -37.8298681237421, 144.98303001088595 {"coordinates": [[[[144.9832974445821, -37.829...
1 -37.829874533279096, 144.97144661356745 {"coordinates": [[[[144.9714529379414, -37.829...
2 -37.83021396760069, 144.98646678135142 {"coordinates": [[[[144.98647926050035, -37.83...
3 -37.828742240015515, 144.9011718210025 {"coordinates": [[[[144.90116683929529, -37.82...
4 -37.829920930428415, 144.96518349051888 {"coordinates": [[[[144.96517363556384, -37.82...

Importing trees-with-species-and-dimensions-urban-forest¶

About the dataset: This dataset details the location, species and lifespan of Melbourne's urban forest by precinct. The City of Melbourne maintains more than 70,000 trees.

In [5]:
# Importing urban forest dataset
urban_forest = 'trees-with-species-and-dimensions-urban-forest' 
df_urban_forest = import_data(urban_forest)
df_urban_forest.to_csv('df_urban_forest.csv', index=False) # saving into a local file
df_urban_forest_orig = df_urban_forest #saving the original dataset
print('First few rows of the dataset:\n')
df_urban_forest.head(5)
 Imported the trees-with-species-and-dimensions-urban-forest dataset with 76928 records succesfully 

First few rows of the dataset:

Out[5]:
com_id common_name scientific_name genus family diameter_breast_height year_planted date_planted age_description useful_life_expectency useful_life_expectency_value precinct located_in uploaddate coordinatelocation latitude longitude easting northing geolocation
0 1049657 Unknown Melaleuca parvistaminea Melaleuca Myrtaceae NaN 1998 1998-12-17 NaN NaN NaN NaN Park 2021-01-10 -37.79070542406818, 144.94466634984954 -37.790705 144.944666 319025.79 5815416.68 -37.79070542406818, 144.94466634984954
1 1782373 Coastal Banksia Banksia integrifolia Banksia Proteaceae NaN 2020 2020-03-04 NaN NaN NaN NaN Park 2021-01-10 -37.802899143753464, 144.92619307686192 -37.802899 144.926193 317429.05 5814027.65 -37.802899143753464, 144.92619307686192
2 1604511 Red Box Eucalyptus polyanthemos Eucalyptus Myrtaceae NaN 2015 2015-05-08 NaN NaN NaN NaN Park 2021-01-10 -37.79572286091489, 144.9693861369436 -37.795723 144.969386 321214.72 5814907.50 -37.79572286091489, 144.9693861369436
3 1070399 Ironbark Eucalyptus sideroxylon Eucalyptus Myrtaceae 12.0 2006 2006-12-19 Semi-Mature 31-60 years 60.0 NaN Street 2021-01-10 -37.82793397289453, 144.90197974533947 -37.827934 144.901980 315359.55 5811202.01 -37.82793397289453, 144.90197974533947
4 1734680 Drooping sheoak Allocasuarina verticillata Allocasuarina Casuarinaceae NaN 2018 2018-09-05 NaN NaN NaN NaN Park 2021-01-10 -37.792723710257945, 144.94819168988934 -37.792724 144.948192 319341.15 5815199.54 -37.792723710257945, 144.94819168988934

Importing 2020-building-footprints¶

About the dataset: This dataset shows the footprints of all structures within the City of Melbourne. A building footprint is a 2D polygon (or multi-polygon) representation of the base of a building or structure. The footprint is defined as the boundary of the structure where the walls intersect with the ground plane or podium, rather than an outline of the roof area (roofprint).

In [6]:
building_footprint = '2020-building-footprints' 
df_building_footprint = import_data(building_footprint)
df_building_footprint.to_csv('df_building_footprint.csv', index=False) # saving into a local file
df_building_footprint_orig = df_building_footprint #saving the original dataset
print('First few rows of the dataset:\n')
df_building_footprint.head(5)
 Imported the 2020-building-footprints dataset with 37750 records succesfully 

First few rows of the dataset:

Out[6]:
geo_point_2d geo_shape footprint_type tier structure_max_elevation footprint_max_elevation structure_min_elevation property_id structure_id footprint_extrusion footprint_min_elevation structure_extrusion roof_type
0 -37.80037045080206, 144.9464370995547 {"coordinates": [[[[144.94650487159868, -37.80... Structure 1 24.5 24.5 15.0 107105.0 818620 9.5 15.0 9.5 Flat
1 -37.80034679637949, 144.94754591875628 {"coordinates": [[[[144.94761503800615, -37.80... Structure 1 23.5 23.5 13.0 107102.0 805966 10.5 13.0 10.5 Flat
2 -37.80027329346989, 144.94824324805376 {"coordinates": [[[[144.94834088635758, -37.80... Structure 1 24.0 24.0 13.0 107100.0 813265 10.5 13.0 11.0 Flat
3 -37.800556970621834, 144.94811576128058 {"coordinates": [[[[144.94818490989783, -37.80... Structure 1 25.5 25.5 16.0 107100.0 813267 9.0 16.0 9.5 Flat
4 -37.80167146256072, 144.94443776463496 {"coordinates": [[[[144.9444642519418, -37.801... Structure 2 21.0 21.0 14.5 105780.0 804769 7.0 14.5 6.5 Flat

Data Cleansing and Preprocessing¶

The Data Cleansing and preprocessing phase focuses on preparing the tree canopies, insect records, urban forests and building footprint datasets for analysis. This involves resolving inconsistencies, handling missing entries, and reformatting data as needed—such as separating latitude and longitude fields, removing redundant columns, and ensuring appropriate structure across datasets. These steps are critical to harmonise the datasets for seamless integration and analysis. By standardising and validating the data, this process enhances the accuracy and reliability of any insights derived.

In [7]:
def split_geo_coordinates(df, geo_column):
    """
    Splits a combined latitude,longitude column into two separate float columns: 'latitude' and 'longitude'.
    
    Parameters:
    - df (pd.DataFrame): The input DataFrame containing the geo column.
    - geo_column (str): The name of the column with 'latitude,longitude' string values.

    Returns:
    - pd.DataFrame: A new DataFrame with separate 'latitude' and 'longitude' columns.
    """
    if geo_column not in df.columns:
        raise ValueError(f"Column '{geo_column}' not found in DataFrame.")

    try:
        # Ensure the geo_column is of string type
        df[geo_column] = df[geo_column].astype(str)

        # Attempt to split the column
        split_data = df[geo_column].str.split(',', expand=True)

        if split_data.shape[1] != 2:
            raise ValueError(f"Column '{geo_column}' does not contain valid 'latitude,longitude' format.")

        df['latitude'] = pd.to_numeric(split_data[0], errors='coerce')
        df['longitude'] = pd.to_numeric(split_data[1], errors='coerce')

        # Drop rows with invalid coordinates
        df.dropna(subset=['latitude', 'longitude'], inplace=True)

        # Drop the original geo column
        df = df.drop(columns=[geo_column])

        print('Dataset Info after Geo Split:\n')
        print(df.info())

    except Exception as e:
        print(f"An error occurred during geolocation splitting: {e}")
        raise

    return df
In [8]:
def check_preprocess_dataset(df_dataset, dataset_name='dataset'):
    """
    Inspects and preprocesses a dataset:
    - Prints dataset info
    - Checks for missing values
    - Removes duplicate rows (if any)

    Parameters:
    - df_dataset (pd.DataFrame): The input DataFrame to be checked and cleaned.
    - dataset_name (str): Optional name of the dataset for logging purposes.

    Returns:
    - pd.DataFrame: A cleaned version of the input DataFrame.
    """
    try:
        if not isinstance(df_dataset, pd.DataFrame):
            raise TypeError("Input is not a pandas DataFrame.")

        print(f'Dataset Information for "{dataset_name}":\n')
        print(df_dataset.info())

        # Check for missing values
        print(f'\nMissing values in "{dataset_name}" dataset:\n')
        print(df_dataset.isnull().sum())

        # Identify and remove duplicates
        dupes = df_dataset.duplicated().sum()
        if dupes > 0:
            df_dataset = df_dataset.drop_duplicates()
            print(f'\nDeleted {dupes} duplicate record(s) from "{dataset_name}".')
        else:
            print(f'\nNo duplicate records found in "{dataset_name}".')

    except Exception as e:
        print(f"An error occurred while preprocessing '{dataset_name}': {e}")
        raise

    return df_dataset

Insect Records Dataset¶

Checking for missing values & duplicate records

In [9]:
df_insect_records = check_preprocess_dataset(df_insect_records, 'Insect Records Dataset')
Dataset Information for "Insect Records Dataset":

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1295 entries, 0 to 1294
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   taxa                  1295 non-null   object 
 1   kingdom               1295 non-null   object 
 2   phylum                1295 non-null   object 
 3   class                 1295 non-null   object 
 4   order                 1295 non-null   object 
 5   family                1290 non-null   object 
 6   genus                 589 non-null    object 
 7   species               264 non-null    object 
 8   identification_notes  1031 non-null   object 
 9   location              1295 non-null   object 
 10  sighting_date         0 non-null      float64
dtypes: float64(1), object(10)
memory usage: 111.4+ KB
None

Missing values in "Insect Records Dataset" dataset:

taxa                       0
kingdom                    0
phylum                     0
class                      0
order                      0
family                     5
genus                    706
species                 1031
identification_notes     264
location                   0
sighting_date           1295
dtype: int64

No duplicate records found in "Insect Records Dataset".

Filling Missing Taxonomic Information in Insect Records¶

In the insect dataset, we have some records with missing genus or species information. To improve the completeness of the data for better biodiversity analysis, we'll extract the missing information from the identification notes where possible.

Approach:

  1. For missing genus names: We look in the identification notes field for words that look like genus names (words that start with a capital letter) and use these to fill in the blanks.

  2. For missing species names: We look for patterns like "sp.1" or "sp 3" in the notes, which are common ways scientists record unidentified species within a genus. We standardise these to a consistent format (e.g., "sp1").

This process helps us create a more complete taxonomic classification, which is key for accurately measuring biodiversity across Melbourne's urban landscape.

In [10]:
def extract_genus_from_notes(notes: str) -> str | None:
    """Extracts the genus name from identification_notes if present.
    The genus is assumed to be the first word beginning with a capital letter.
    Returns None if no genus-like pattern is found.
    """
    if not isinstance(notes, str):
        return None
    # Match the first word starting with an uppercase letter followed by lowercase letters
    match = re.match(r'\b([A-Z][a-zA-Z]+)\b', notes)
    return match.group(1) if match else None

def extract_species_code(notes: str) -> str | None:
    """Extracts the species number code from identification_notes.
    Returns a string of the form 'sp<number>' or None if no number is found.
    """
    if not isinstance(notes, str):
        return None
    # Look for patterns like 'sp.1', 'sp 1', 'sp. 3', etc.
    match = re.search(r'\bsp\.?\s*(\d+)\b', notes, flags=re.IGNORECASE)
    if match:
        number = match.group(1)
        return f"sp{number}"
    return None
In [11]:
# Apply genus extraction
missing_genus_mask = df_insect_records['genus'].isna() | (df_insect_records['genus'].str.strip() == "")
df_insect_records.loc[missing_genus_mask, 'genus'] = df_insect_records.loc[missing_genus_mask, 'identification_notes'].apply(extract_genus_from_notes)

# Apply species code extraction for rows with null species or empty string
missing_species_mask = df_insect_records['species'].isna() | (df_insect_records['species'].str.strip() == "")
df_insect_records.loc[missing_species_mask, 'species'] = df_insect_records.loc[missing_species_mask, 'identification_notes'].apply(extract_species_code)

# display first 5 rows of the updated dataset
df_insect_records[['genus', 'species', 'identification_notes']].head() 
Out[11]:
genus species identification_notes
0 Pteromalidae None Pteromalidae 4
1 Pyrgotidae None Pyrgotidae 1
2 Scenopinidae None Scenopinidae 2
3 Sepsidae None Sepsidae 1
4 Stratiomyidae None Stratiomyidae 2
In [12]:
# get unique location names
locations = df_insect_records['location'].dropna().unique()

print(f"Unique locations in the dataset: {len(locations)}") 
print(locations)
Unique locations in the dataset: 15
['Fitzroy-Treasury Gardens' 'Royal Park' 'Princes Park' 'Lincoln Square'
 'Pleasance Gardens' "Women's Peace Gardens" 'Carlton Gardens South'
 'Westgate Park' 'Canning/Neil Street Reserve' 'Murchinson Square'
 'Argyle Square' 'State Library of Victoria' 'University Square'
 'Gardiner Reserve' 'Garrard Street Reserve']

Updating some values of location to correct values to retrieve accurate values of their gelocations

In [13]:
# Define your mapping:
mapping = {
    'Fitzroy-Treasury Gardens': 'Treasury Gardens',
    "Women's Peace Gardens": 'Peace Gardens',
    'Canning/Neil Street Reserve': 'Canning Street Reserve',
    'Murchinson Square': 'Murchison Square',
    'Garrard Street Reserve': 'Gerard Street Reserve',
}

# Apply it in‐place to the location column:
df_insect_records['location'] = df_insect_records['location'].replace(mapping)

# (Optional) Verify:
print(df_insect_records['location'].unique())

# get unique location names
locations = df_insect_records['location'].dropna().unique()
['Treasury Gardens' 'Royal Park' 'Princes Park' 'Lincoln Square'
 'Pleasance Gardens' 'Peace Gardens' 'Carlton Gardens South'
 'Westgate Park' 'Canning Street Reserve' 'Murchison Square'
 'Argyle Square' 'State Library of Victoria' 'University Square'
 'Gardiner Reserve' 'Gerard Street Reserve']
In [14]:
# geocode each location
geolocator = Nominatim(user_agent="little_things_project")
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=1)  # rate‑limit calls

coords = {}
for loc in locations:
    query = f"{loc}, Melbourne, Victoria, Australia"
    result = geocode(query)
    if result:
        coords[loc] = {'latitude': result.latitude, 'longitude': result.longitude}
    else:
        coords[loc] = {'latitude': None, 'longitude': None}

# map lat/lon back onto the dataset
df_insect_records['latitude'] = df_insect_records['location'].map(lambda x: coords.get(x, {}).get('latitude'))
df_insect_records['longitude'] = df_insect_records['location'].map(lambda x: coords.get(x, {}).get('longitude'))
In [15]:
df_insect_records.head(5)  # display first 5 rows of the updated dataset with geocoded coordinates
Out[15]:
taxa kingdom phylum class order family genus species identification_notes location sighting_date latitude longitude
0 Insect ANIMALIA ARTHROPODA INSECTA HYMENOPTERA PTEROMALIDAE Pteromalidae None Pteromalidae 4 Treasury Gardens NaN -37.814316 144.975998
1 Insect ANIMALIA ARTHROPODA INSECTA DIPTERA PYRGOTIDAE Pyrgotidae None Pyrgotidae 1 Royal Park NaN -37.781268 144.951682
2 Insect ANIMALIA ARTHROPODA INSECTA DIPTERA SCENOPINIDAE Scenopinidae None Scenopinidae 2 Royal Park NaN -37.781268 144.951682
3 Insect ANIMALIA ARTHROPODA INSECTA DIPTERA SEPSIDAE Sepsidae None Sepsidae 1 Princes Park NaN -37.783751 144.961831
4 Insect ANIMALIA ARTHROPODA INSECTA DIPTERA STRATIOMYIDAE Stratiomyidae None Stratiomyidae 2 Lincoln Square NaN -37.802439 144.962880

Selecting the relevant columns¶

Selecting the relevant columns and dropping the rest of the columns.

In [16]:
df_insect_records = df_insect_records[['taxa', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus',
       'location','latitude', 'longitude']]
#print the columns in the
print(df_insect_records.columns)
Index(['taxa', 'kingdom', 'phylum', 'class', 'order', 'family', 'genus',
       'location', 'latitude', 'longitude'],
      dtype='object')

Tree Canopies 2021 dataset¶

Checking for missing values & duplicate records

In [17]:
df_tree_canopy_2021 = check_preprocess_dataset(df_tree_canopy_2021, 'Tree Canopies 2021')
Dataset Information for "Tree Canopies 2021":

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57980 entries, 0 to 57979
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   geo_point_2d  57980 non-null  object
 1   geo_shape     57980 non-null  object
dtypes: object(2)
memory usage: 906.1+ KB
None

Missing values in "Tree Canopies 2021" dataset:

geo_point_2d    0
geo_shape       0
dtype: int64

No duplicate records found in "Tree Canopies 2021".

To facilitate spatial analysis, the geo_point_2d column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualisations. Finally, the original geo_point_2d column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.

In [18]:
#splitting geo coordinates
df_tree_canopy_2021 = split_geo_coordinates(df_tree_canopy_2021,'geo_point_2d')
print('First few rows of the dataset after preprocessing:\n')
df_tree_canopy_2021.head(5)
Dataset Info after Geo Split:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57980 entries, 0 to 57979
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   geo_shape  57980 non-null  object 
 1   latitude   57980 non-null  float64
 2   longitude  57980 non-null  float64
dtypes: float64(2), object(1)
memory usage: 1.3+ MB
None
First few rows of the dataset after preprocessing:

Out[18]:
geo_shape latitude longitude
0 {"coordinates": [[[[144.9832974445821, -37.829... -37.829868 144.983030
1 {"coordinates": [[[[144.9714529379414, -37.829... -37.829875 144.971447
2 {"coordinates": [[[[144.98647926050035, -37.83... -37.830214 144.986467
3 {"coordinates": [[[[144.90116683929529, -37.82... -37.828742 144.901172
4 {"coordinates": [[[[144.96517363556384, -37.82... -37.829921 144.965183

Selecting the relevant columns¶

Selecting the relevant columns and dropping the rest of the columns.

In [19]:
df_tree_canopy_2021 = df_tree_canopy_2021[['geo_shape', 'latitude', 'longitude']]

# Print the columns in the updated dataset
print(df_tree_canopy_2021.columns)
Index(['geo_shape', 'latitude', 'longitude'], dtype='object')

Urban Forest Dataset¶

Checking for missing values & duplicate records

In [20]:
df_urban_forest = check_preprocess_dataset(df_urban_forest, 'Urban Forest Dataset')
Dataset Information for "Urban Forest Dataset":

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76928 entries, 0 to 76927
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   com_id                        76928 non-null  int64  
 1   common_name                   76903 non-null  object 
 2   scientific_name               76927 non-null  object 
 3   genus                         76927 non-null  object 
 4   family                        76927 non-null  object 
 5   diameter_breast_height        24986 non-null  float64
 6   year_planted                  76928 non-null  int64  
 7   date_planted                  76928 non-null  object 
 8   age_description               24969 non-null  object 
 9   useful_life_expectency        24969 non-null  object 
 10  useful_life_expectency_value  24969 non-null  float64
 11  precinct                      0 non-null      float64
 12  located_in                    76926 non-null  object 
 13  uploaddate                    76928 non-null  object 
 14  coordinatelocation            76928 non-null  object 
 15  latitude                      76928 non-null  float64
 16  longitude                     76928 non-null  float64
 17  easting                       76928 non-null  float64
 18  northing                      76928 non-null  float64
 19  geolocation                   76928 non-null  object 
dtypes: float64(7), int64(2), object(11)
memory usage: 11.7+ MB
None

Missing values in "Urban Forest Dataset" dataset:

com_id                              0
common_name                        25
scientific_name                     1
genus                               1
family                              1
diameter_breast_height          51942
year_planted                        0
date_planted                        0
age_description                 51959
useful_life_expectency          51959
useful_life_expectency_value    51959
precinct                        76928
located_in                          2
uploaddate                          0
coordinatelocation                  0
latitude                            0
longitude                           0
easting                             0
northing                            0
geolocation                         0
dtype: int64

No duplicate records found in "Urban Forest Dataset".

Deleting records with missing taxonomic information about the tree species as missing information cannot be generated.

In [21]:
# Identify records with missing taxonomic information
missing_taxonomy = df_urban_forest[
    df_urban_forest['genus'].isna() | 
    df_urban_forest['family'].isna() | 
    df_urban_forest['scientific_name'].isna()
]

# Print the full record information
print(f"Found {len(missing_taxonomy)} records with missing taxonomic information:")

#deleting the records with missing taxonomic information
df_urban_forest = df_urban_forest.dropna(subset=['genus', 'family', 'scientific_name']) 

# Print the delete confirmation
print(f"Deleted {len(missing_taxonomy)} records with missing taxonomic information:")
Found 1 records with missing taxonomic information:
Deleted 1 records with missing taxonomic information:

Selecting the relevant columns¶

Selecting the relevant columns and dropping the rest of the columns.

In [22]:
df_urban_forest = df_urban_forest[['com_id', 'common_name', 'scientific_name', 'genus', 'family',
       'year_planted', 'date_planted',
       'latitude', 'longitude', 'easting', 'northing',
       'geolocation']]
print(df_urban_forest.columns)
Index(['com_id', 'common_name', 'scientific_name', 'genus', 'family',
       'year_planted', 'date_planted', 'latitude', 'longitude', 'easting',
       'northing', 'geolocation'],
      dtype='object')

Building Footprint Dataset¶

Checking for missing values & duplicate records

In [23]:
df_building_footprint = check_preprocess_dataset(df_building_footprint, 'Building Footprint Dataset')
Dataset Information for "Building Footprint Dataset":

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37750 entries, 0 to 37749
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   geo_point_2d             37750 non-null  object 
 1   geo_shape                37750 non-null  object 
 2   footprint_type           37750 non-null  object 
 3   tier                     37750 non-null  int64  
 4   structure_max_elevation  37750 non-null  float64
 5   footprint_max_elevation  37750 non-null  float64
 6   structure_min_elevation  37750 non-null  float64
 7   property_id              37736 non-null  float64
 8   structure_id             37750 non-null  int64  
 9   footprint_extrusion      37750 non-null  float64
 10  footprint_min_elevation  37750 non-null  float64
 11  structure_extrusion      37750 non-null  float64
 12  roof_type                37750 non-null  object 
dtypes: float64(7), int64(2), object(4)
memory usage: 3.7+ MB
None

Missing values in "Building Footprint Dataset" dataset:

geo_point_2d                0
geo_shape                   0
footprint_type              0
tier                        0
structure_max_elevation     0
footprint_max_elevation     0
structure_min_elevation     0
property_id                14
structure_id                0
footprint_extrusion         0
footprint_min_elevation     0
structure_extrusion         0
roof_type                   0
dtype: int64

No duplicate records found in "Building Footprint Dataset".
No duplicate records found in "Building Footprint Dataset".

Analysing the missing information and discovered that missing property_id is only for footprint_type as 'Bridge', hence concluded that this can be retained as is.

In [24]:
# Count unique values in property_id and structure_id columns
property_count = df_building_footprint['property_id'].nunique()
structure_count = df_building_footprint['structure_id'].nunique()

# Count unique combinations of property_id and structure_id
combined_count = df_building_footprint.groupby(['property_id', 'structure_id']).ngroups

print(f"Number of unique property IDs: {property_count}")
print(f"Number of unique structure IDs: {structure_count}")
print(f"Number of unique property ID and structure ID combinations: {combined_count}")

# Find rows where property_id is null
null_property_mask = df_building_footprint['property_id'].isna()

# Extract structure_ids where property_id is null
structure_ids_with_null_property = df_building_footprint.loc[null_property_mask, 'structure_id'].tolist()

# Print the count and values
print(f"Found {len(structure_ids_with_null_property)} structures with null property IDs:")
print(structure_ids_with_null_property)

# More efficient approach: Filter once for all structure IDs in the list
filtered_df = df_building_footprint[df_building_footprint['structure_id'].isin(structure_ids_with_null_property)]
selected_columns = filtered_df[['structure_id', 'property_id', 'footprint_type']]

# Show results grouped by structure_id
print(f"\nData for all {len(structure_ids_with_null_property)} structures with null property IDs:")
print(selected_columns.sort_values(by='structure_id'))
Number of unique property IDs: 14102
Number of unique structure IDs: 19018
Number of unique property ID and structure ID combinations: 19153
Found 14 structures with null property IDs:
[802056, 802066, 802067, 802059, 802063, 802060, 802057, 802062, 802069, 802061, 802064, 802065, 802068, 802058]

Data for all 14 structures with null property IDs:
       structure_id  property_id footprint_type
1224         802056          NaN         Bridge
14949        802057          NaN         Bridge
32411        802058          NaN         Bridge
8975         802059          NaN         Bridge
12496        802060          NaN         Bridge
22107        802061          NaN         Bridge
14950        802062          NaN         Bridge
8976         802063          NaN         Bridge
22108        802064          NaN         Bridge
22109        802065          NaN         Bridge
1225         802066          NaN         Bridge
1226         802067          NaN         Bridge
30783        802068          NaN         Bridge
14951        802069          NaN         Bridge

To facilitate spatial analysis, the geo_point_2d column was split into separate latitude and longitude columns. These new columns were then converted into numeric formats to allow for further computations and visualisations. Finally, the original geo_point_2d column was dropped to avoid redundancy, leaving a clean and structured dataset ready for spatial analysis and modeling.

In [25]:
df_building_footprint = split_geo_coordinates(df_building_footprint, 'geo_point_2d')
print('First few rows of the dataset after preprocessing:\n')
df_building_footprint.head(5)
Dataset Info after Geo Split:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37750 entries, 0 to 37749
Data columns (total 14 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   geo_shape                37750 non-null  object 
 1   footprint_type           37750 non-null  object 
 2   tier                     37750 non-null  int64  
 3   structure_max_elevation  37750 non-null  float64
 4   footprint_max_elevation  37750 non-null  float64
 5   structure_min_elevation  37750 non-null  float64
 6   property_id              37736 non-null  float64
 7   structure_id             37750 non-null  int64  
 8   footprint_extrusion      37750 non-null  float64
 9   footprint_min_elevation  37750 non-null  float64
 10  structure_extrusion      37750 non-null  float64
 11  roof_type                37750 non-null  object 
 12  latitude                 37750 non-null  float64
 13  longitude                37750 non-null  float64
dtypes: float64(9), int64(2), object(3)
memory usage: 4.0+ MB
None
First few rows of the dataset after preprocessing:

Out[25]:
geo_shape footprint_type tier structure_max_elevation footprint_max_elevation structure_min_elevation property_id structure_id footprint_extrusion footprint_min_elevation structure_extrusion roof_type latitude longitude
0 {"coordinates": [[[[144.94650487159868, -37.80... Structure 1 24.5 24.5 15.0 107105.0 818620 9.5 15.0 9.5 Flat -37.800370 144.946437
1 {"coordinates": [[[[144.94761503800615, -37.80... Structure 1 23.5 23.5 13.0 107102.0 805966 10.5 13.0 10.5 Flat -37.800347 144.947546
2 {"coordinates": [[[[144.94834088635758, -37.80... Structure 1 24.0 24.0 13.0 107100.0 813265 10.5 13.0 11.0 Flat -37.800273 144.948243
3 {"coordinates": [[[[144.94818490989783, -37.80... Structure 1 25.5 25.5 16.0 107100.0 813267 9.0 16.0 9.5 Flat -37.800557 144.948116
4 {"coordinates": [[[[144.9444642519418, -37.801... Structure 2 21.0 21.0 14.5 105780.0 804769 7.0 14.5 6.5 Flat -37.801671 144.944438

Selecting the relevant columns¶

Selecting the relevant columns and dropping the rest of the columns.

In [26]:
df_building_footprint = df_building_footprint[['geo_shape', 'footprint_type', 'tier', 'structure_max_elevation',
       'structure_min_elevation', 'property_id', 'structure_id', 'latitude',
       'longitude']]
print(df_building_footprint.columns)
Index(['geo_shape', 'footprint_type', 'tier', 'structure_max_elevation',
       'structure_min_elevation', 'property_id', 'structure_id', 'latitude',
       'longitude'],
      dtype='object')

Data Analysis and Visualisation¶

In this section, we explore Melbourne's urban biodiversity using interactive maps and charts. These visualisations help us understand the distribution of trees, insects, and buildings across the city, making it easier for everyone to see where nature thrives and where improvements can be made.

Each visualisation is explained in simple terms, so you can easily interpret what the data shows and how it relates to the health and connectivity of our urban environment.

Tree Canopy Map¶

This map shows the spread of tree canopies across Melbourne. Each green area represents the coverage of tree leaves and branches, which provide shade, cool the city, and support wildlife.

How to read this map:

  • Larger green areas mean more tree cover, which is good for the environment and people.
  • Smaller or missing green areas highlight places that may need more trees or greening.

By looking at this map, you can easily spot which parts of the city are well-covered by trees and which areas could benefit from more planting.

In [34]:
# Convert Tree Canopies dataset into a GeoDataFrame if not already done
if 'geometry' not in df_tree_canopy_2021.columns:
    df_tree_canopy_2021['geometry'] = df_tree_canopy_2021['geo_shape'].apply(lambda x: shape(json.loads(x)))

# Create GeoDataFrame
gdf = gpd.GeoDataFrame(df_tree_canopy_2021, geometry='geometry', crs='EPSG:4326')

# Project to Web Mercator for compatibility with contextily basemaps
gdf_projected = gdf.to_crs(epsg=3857)

# Create the plot
fig, ax = plt.subplots(figsize=(14, 12))

# Plot tree canopy data
gdf_projected.plot(ax=ax, color='green', edgecolor='darkgreen', alpha=0.7, 
                  label='Tree Canopies')

# Add the contextily basemap
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)

# Add title and labels
ax.set_title('Tree Canopy Coverage (2021) - Melbourne', fontsize=14)
ax.set_xlabel("Longitude", fontsize=12)
ax.set_ylabel("Latitude", fontsize=12)

# Improve readability
plt.tight_layout()

# Show the plot
plt.show()
No description has been provided for this image

Insect Diversity Charts¶

These charts show the variety of insect species found in different parts of Melbourne. Each bar or section represents a group of insects, helping us see which areas have the most diversity.

How to read these charts:

  • Taller bars or larger sections mean more types of insects are present, which is a sign of a healthy ecosystem.
  • Shorter bars or smaller sections show fewer species, which may mean the area needs more habitat support.

By understanding insect diversity, we can identify places that are rich in life and those that could benefit from more conservation efforts.

In [28]:
# Count unique insect species per location
species_by_location = df_insect_records.groupby('location')['genus'].nunique().sort_values(ascending=False)

plt.figure(figsize=(12,6))
sns.barplot(x=species_by_location.index, y=species_by_location.values, palette='viridis')
plt.xticks(rotation=90)
plt.xlabel('Location')
plt.ylabel('Number of Unique Insect Genera')
plt.title('Insect Diversity Across Melbourne Locations')
plt.tight_layout()
plt.show()
No description has been provided for this image

Insect Habitat Density Map This map displays the distribution of insect habitats across Melbourne’s green spaces. Each blue circle marks a site where insect surveys took place. How to read this map:

  • Larger circles indicate parks or reserves with a higher number of insect observations, pointing to richer biodiversity at those locations.
  • Smaller circles denote sites where fewer insects were recorded, suggesting areas that might benefit from targeted ecological improvements.

By examining this map, you can quickly identify which parts of the city are most supportive of insect life and which areas could be strengthened to enhance urban biodiversity.

In [29]:
# compute density (number of insect records) at each site
density_by_location = df_insect_records.groupby('location').size().rename('density')

# aggregate to one row per location, including coordinates
site_summary = (
    df_insect_records
    .groupby('location')
    .agg({
        'latitude': 'mean', 
        'longitude': 'mean'
    })
    .reset_index()
)

# merge density values into the summary
site_summary = site_summary.merge(
    density_by_location, 
    left_on='location', 
    right_index=True
)

# build the Folium map centred on Melbourne
m = folium.Map(location=[-37.81, 144.96], zoom_start=13)

# add circle markers with radius scaled by density
for _, row in site_summary.iterrows():
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=4 + row['density'] * 0.1,  # adjust multiplier to suit
        tooltip=row['location'],
        fill=True,
        fill_opacity=0.6
    ).add_to(m)

# display map
m
Out[29]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Building Footprint Map¶

This map displays the locations and types of buildings throughout Melbourne. Different symbols and colours are used to show the type of building footprint, making it easy to see the variety of structures in the city.

How to read this map:

  • Each point or shape represents a building or structure.
  • The colour or symbol tells you what type of footprint it is, such as residential, commercial, or other.
  • Areas with lots of buildings may act as barriers to wildlife movement, while open spaces can help connect habitats.

By viewing this map, you can understand how buildings are spread out and how they might affect the movement of animals and plants in the city.

In [31]:
# Define colour mapping for footprint types
footprint_colours = {
    'Structure': 'grey',
    'Bridge': 'red',
    'Tram Stop': 'orange',
    'Jetties': 'green',
    'Ramp': 'purple',
    'Toilet': 'pink',
    'Train Platform': 'brown'
}

# Create a base map centred on Melbourne
m_buildings = folium.Map(location=melbourne_coords, zoom_start=16)

for idx, row in df_building_footprint.iterrows():
    # Assign color based on footprint_type, default to 'blue' if not found
    colour = footprint_colours.get(row['footprint_type'], 'blue')
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=0.5,
        color=colour,
        fill=True,
        fill_opacity=0.2,
        popup=f"Type: {row['footprint_type']}"
    ).add_to(m_buildings)

# Add legend using branca.element
legend_html = """
{% macro html(this, kwargs) %}
<div style="
    position: fixed; 
    bottom: 50px; left: 50px; width: 180px; height: 180px; 
    z-index:9999; font-size:14px;
    background: white; border:2px solid grey; border-radius:8px; padding: 10px;">
    <b>Legend</b><br>
    <i style="color:grey;">&#9679;</i> Structure<br>
    <i style="color:red;">&#9679;</i> Bridge<br>
    <i style="color:orange;">&#9679;</i> Tram Stop<br>
    <i style="color:green;">&#9679;</i> Jetties<br>
    <i style="color:purple;">&#9679;</i> Ramp<br>
    <i style="color:pink;">&#9679;</i> Toilet<br>
    <i style="color:brown;">&#9679;</i> Train Platform<br>
    <i style="color:blue;">&#9679;</i> Other
</div>
{% endmacro %}
"""

macro = MacroElement()
macro._template = Template(legend_html)
m_buildings.get_root().add_child(macro)

m_buildings
Out[31]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Urban Forest Diversity Visualisations¶

These maps and charts show the different types of trees found in Melbourne’s urban forest. By looking at the diversity of tree species, we can see which areas have a rich mix of trees and which might need more variety.

How to read these visualisations:

  • Areas with many different tree species are healthier and support more insect species.
  • Places with only a few types of trees may be less resilient to pests or climate change.
  • The charts help you compare tree diversity across different neighbourhoods or parks.

Understanding tree diversity helps us plan for a greener, more resilient city that benefits both people and nature.

In [33]:
# replace 'longitude' and 'latitude' with the actual column names if different
if 'longitude' in df_urban_forest.columns and 'latitude' in df_urban_forest.columns:
    gdf = gpd.GeoDataFrame(
        df_urban_forest, geometry=gpd.points_from_xy(df_urban_forest.longitude, df_urban_forest.latitude),
        crs="EPSG:4326"
    )
else:
    raise ValueError("The dataset must contain 'longitude' and 'latitude' columns.")

# Convert GeoDataFrame to Web Mercator (EPSG:3857) for compatibility with basemaps
gdf = gdf.to_crs(epsg=3857)

# Plotting the GeoDataFrame, colour coding by 'schedule'
fig, ax = plt.subplots(figsize=(16, 12))
gdf.plot(    ax=ax, column='family', categorical=True, markersize=2,
    legend=True, legend_kwds={'loc': 'center left', 'bbox_to_anchor': (1.05, 0.5),
                              'title': 'Family', 'fontsize': 6, 'ncol': 2})

# Add a basemap (using Stamen Toner Lite tiles)
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)

# Set title and remove axis for a cleaner look
ax.set_title('Planting Species on Melbourne Map by its Family')
ax.set_axis_off()

plt.tight_layout()
plt.show()
No description has been provided for this image